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1. INTRODUCTION 

Endometriosis is a challenging problem for female of fertile group. The endometriosis phases vary 
from person to person based on the location and severity of occurrences. The endometrium layer inside the 
uterus shed out for every menstruation. If the layer spread across multiple locations leads to endometriosis. 
The phases are classified as: i) endometriosis inside the uterus, ii) ovarian endometriosis, iii) peritoneum 
endometriosis, and iv) deep infiltrating endometriosis. Endometriosis was identified through scanning 
procedures. The most accurate position of endometriosis was diagnosed through the standard laparoscopic 
operating procedure. 

The severity of endometriosis affects women both physically and mentally. The external factors 
identified for predicting endometriosis were severe abdominal pain, dysmenorrhea, dyspareunia, abnormal 
uterine bleeding, and breast tenderness. These external factors play a vital role in the prediction that leads to 
the scanning and laparoscopic procedure. The internal factors identified through the laparoscopic procedure 
emphasize the severity of endometriosis. The internal factors include adnexal mass, tissue-like structure, and 
changes in tissue color [1]. 

Machine learning algorithms played a predominant role in diagnosing various types of diseases. The 
types of learning algorithms includes random forest, decision tree, logistic regression, support vector machine, 
logistic, and linear regression [2]. These algorithms predicts various types of disease. The decision tree is a 
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learning algorithm that analyze various features of the problem by constructing a tree. The tree consists of a 
root node at the top followed by intermediate leaf nodes and final layer consist of decision nodes. The decision 
tree identifies the features that was most suitable for analysis [3]. 

In the decision tree, a concept known as information gain was used for splitting the nodes. There are 
two approaches used for implementing information gain [4]. They are: i) entropy and ii) Gini index. 
Information gain identifies the best features that are suitable for classification. 


2. RELATED STUDIES 

The decision tree was used to forecast the possibilities of cough through the features of fractional 
exhaled nitric oxide, expiratory flow, and eosinophils. Among all attributes, eosinophils have a major impact 
on cough [5]. Breast cancer was predicted using logistic regression and decision tree algorithm. Various 
features including hormonal therapy, tumor size, histological grade, and diagnosis. Were considered for 
evaluation. The decision tree outperforms well in forceasting the survival rate of cancer when compared with 
logistic regression and decision tree (C5.0) yields higher accuracy of 86.9% in predicting the breast cancer [6]. 
Birth defects was analyzed using decision tree algorithms. Various factors including family hereditary, 
hypertension, diabetes, and nephropathy are considered. Decision tree (C5.0) and C4.5 was used for evaluation. 
C4.5 algorithm outperforms well in 9.33 seconds and 94.15% of accuracy [7]. 

The decision tree was also used for analyzing the various strategies for endometriosis. Three levels of 
the evaluation were performed using various factors for predicting endometriosis. “Deep dyspareunia, cyclic 
defecation pain, cyclic urinary signs” were considered for evaluation. The decision tree uses second-line and 
third-line evaluation for endometriosis diagnosis as deep lesions [8]. The factors influencing heart disease were 
predicted using the decision tree. Among multiple attributes, Thalassemia, type of Chest pain, and major 
vessels color were identified as the best features and the model yields an accuracy of 85% [9]. The decision 
tree helps in evaluating the features by invoking entropy and the Gini index. The decision tree predicts the 
various types of diseases including cough, thyroid, cancer, heart problems, birth defects, and endometriosis. 
C5.0 algorithm works well in predicting the breast cancer, C4.5 outperforms well for predicting birth defects. 
Internal factors play a vital role in predicting the category of endometriosis. The internal factors were identified 
as the attributes in endometriosis prediction. It includes the size of the tissues, tissue color, mass identified, and 
blockages in fallopian tubes. These factors were identified by the retrospective study provided by gynecologist 
and radiologist [10]. The size of the lesion varies from 1 mm to 6mm, the color of the tissue exist in red, dark 
brown, and black colour. In several cases, adnexal mass was identified along with blockages in fallopian tubes. 
A total of 600 records were considered for execution. The list of features are as follows: Size of tissue [1], 
Color of tissue [2], Blockages in fallopian tubes [3], and Adnexal mass [4]. 


3. FEATURE ANALYSIS OF ENDOMETRIOSIS USING DECISION TREE 
The steps involved in decision tree analysis of endometriosis classification as illustrated in the Figure 1. 
a) Dataset holding endometriosis influencing internal factors. 
b) Splitting of training and testing datasets. 
c) Ordinal encoding of training and testing datasets. 
d) Decision tree evaluation using Gini and entropy 
e) Performance evaluation. 


3.1. Data pre-processing 

The dataset holds around 600 records of 5 attributes. The independent and dependent attributes were 
identified. The dataset was divided as training set holds 402 records and testing data holds 198 records 
containing both independent and dependent attributes. The split data was transformed using encoders. Two 
encoders were frequently used. They are: i) one-hot encoding and ii) ordinal encoding. The ordinal encoding 
technique [11] was adopted for the given dataset as it contains categorical values. The ordinal encoding was 
applied to the independent attributes of both the training and testing datasets. 


3.2. Decision tree analysis 

The decision tree is a learning technique that performs both classification and predictions. Here the 
tree is organized as leaf nodes, root nodes, and decision nodes. Root nodes appear at the top, leaf nodes appear 
at the middle and decision nodes appear at the bottom layer. The decision tree [12] identifies the features that 
are helpful for classification and prediction. The two major terms help in analyzing the features. They are: i) 
Gini impurity and ii) entropy. 
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Figure 1. Steps involved in analyzing the features using decision tree 


Entropy [13] is a term used in the decision tree for splitting trees into smaller subsets. Entropy was 
used to identify the best feature in the given datasets. The formula for calculating entropy was as follows: 


Entropy(E) = —X*, p(i). logs. p(i)) (1) 
here p(i) represents the probability of value and probability always falls in the range of 0 to 1. 
Gini index [14] is a term used for constructing a decision tree. The Gini index determines how well 


the decision tree was constructed. Similar to entropy, Gini impurity identifies the feature that suits well for 
classification. The Gini value lies between 0 to 0.5. The formula for calculating the Gini index was as follows: 


Gini = 1-— YL, (pi) (2) 


where p(i) is the probability that falls between 0 to 0.5. 


Pseudocode: 

Start 

Read data=: Internal factors of endometriosis 
X, Y: = Independent and dependent attributes 


Split Training Data (training X, training_Y) 
Testing Data (testing X, tesingt_Y) 
Perform Encoding: train _X = encoder.fit (training X) 
test_X = encoder.fit (testing_X) 
Modell:= Gini. Fit (training X, training Y) 
Model2: = Entrophy.Fit (training X, training_Y) 
Construct Confusion matrix (testing_Y, pred_Gini_Y) 
Construct Confusion matrix (testing Y, pred Entrophy_ Y) 
Visualize receiver operating characteristic (ROC) and area under the curve (AUC) (Gini and 
Entrophy) 


3.2.1. Evaluation metrics 

The dataset was implemented using the decision tree model by selecting the appropriate features using 
entropy and Gini index. The implemented data was evaluated using several metrics including specificity, 
sensitivity, precision, accuracy, and F1 score through a classification matrix. 
- Precision [15] is the proportion of true and untrue positive values made by the model. 


precsion a True values 


(3) 


Overall True values 
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- Recall (Sensitivity) [16] is the proportion of identified real positive values to the whole positive values. 


True positive 
recall = E (4) 


Overall positive values 


Specificity [17] is defined as the model can predict the accurate negative values for the classification 
performed. 


Accurate negative 


specif teity = —AA@A9@39§8o_oo 
P f y Accurate negative+ Inaccurate postive (3) 
- FI score [18] is the weighted mean of precision and sensitivity. 
2«precision«sensitivit 
fc epee case Mela (6) 


precision+sensitivity 


AUC-ROC Curve 
The area under the curve — Receiver operating characteristics [19] is a graph to identify the capability of a 
model to differentiate between two classes. AUC was plotted across true predicate values and false predicate values. 


4. RESULTS AND DISCUSSIONS 

The identified dataset was spitted as training and testing sets randomly. Ordinal entropy [20] was 
implemented to both training dataset and test dataset. Now the entropy and Gini impurity were implemented 
on the training and testing dataset. The highly influencing features of endometriosis were identified by using 
entropy and the Gini index. The Figure 2 illustrates the decision tree constructed using entropy. In figure X[0] 
represents Mass, X[1] represents blockages in fallopian tubes, X[2] represents tissue colour, and X[3] 
represents the size of the tissue. The features identified using entropy were the size of the tissue, blockage in 
fallopian tubes, and mass. Among all features, tissue size was identified as a predominant attribute in 
classifying the endometriosis with an entropy value greater than 0.75. The Figure 3 illustrates the decision tree 
constructed using the Gini index. In figure X[0] represents Mass, X[1] represents blockages in fallopian tubes, 
X[2] represents tissue colour, and X[3] represents the size of tissue [21]. 

The features identified using Gini index was size of tissue, blockage in fallopian tubes, mass, and 
tissue colour. Among all features, tissue size was identified as predominant attribute in classifying the 
endometriosis with Gini value of 0.42, Gini index for tube blockage was 0.236. The endometriosis influencing 
factors was analyzed by constructing decision tree algorithm. The performance of decision tree model was 
assesed by several metrics: i) precision, ii) recall, iii) specificity, iv) Fl score, and v) accuracy [22], [23]. Two 
factors including entropy and Gini index were considered for analysing the features of endometriosis. Among 
two factors Gini index outperforms well in terms of various metrics. The confusion matrix obtained was 
illustrated in Figure 4 for entropy as Figure 4(a) and Gini index as Figure 4(b) 
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Figure 2. Construction of decision tree via entropy 
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Figure 3. Construction of decision tree via Gini index 
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Figure 4. Confusion matrix (a) confusion matrix using entropy and (b) confusion matrix using Gini index 


The predicted and actual values for entropy was 68, 98, 8 and 24 respectively. Similarly, for the Gini 
index the true positive was 66, the true negative was 102, the false positive was 4, and the false negative was 
26. Based on the classification matrix other metrics were evaluated and illustrated in the Table 1 and Figure 5. 

The precision, recall, specificity, and Fl score for entropy were 89.47, 73.91, 92.45, and 80.94 
respectively. Similarly for the Gini index, the precision was 94.2, the recall was 71.73. Specificity was 96.22 
and the Fl Score was 81.44. Gini index outperforms well in terms of various metrics when compared to 
entropy. The next metric accuracy was evaluated [24]. The accuracy was computed for training data and test 
data. The executed model obtained training accuracy was 80.85% and testing accuracy was 83.84% for entropy. 
For the Gini index the model obtained the accuracy for training and test data was 84.08% and 84.85% 
respectively and the comparison was illustrated in Figure 6 and Table 2. 


Table 1. Performance metrics comparison of entropy and Gini 
Precision Recall _ Specificity Fl 
Entropy 89.47 73.91 92.45 80.94 
Gini 94.2 71.73 96.22 81.44 
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Figure 6. Accuracy score for entropy and Gini 


Table 2. Accuracy score for entropy and Gini 
Model with entropy _ Model with Gini 

Training score 80.85 84.08 

Testing Score 83.84 84.85 


The other metric area under the curve [25] was evaluated for the given model in terms for both entropy 
and Gini index. AUC was constructed by plotting sensitivity against specificity. The AUC values for the 
prediction of endometriosis obtained for entropy were 0.87 and 0.89 as the AUC value for the Gini index as 


illustrated in the Figure 7. 
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Figure 7. Area under curve for entropy and Gini 
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5. CONCLUSION 

Endometriosis nowadays considered as a pretty common disease affecting 15% of women’s global 
population. The impact of endometriosis affected women are more vigorous. From the laparoscopic surgery, 
several symptoms were identified and including mass, extra tissue size, extra tissue colour, and blockages in 
fallopian tubes. The decision tree algorithm evaluates the best features using the Gini index and entropy. The 
best features includes size of tissue, mass was more accurately predicted using Gini index with an accuracy of 
84.85%, precision of 94.2%, recall of 71.73%, specificity of 92.45%, and F1 score of 81.44% respectively. The 
area under the curve obtained for entropy was 0.89 and the Gini index was 0.87 respectively. Gini index 
performs well in identifying the most suitable features of endometriosis. 
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